Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Task Fusion #113

Open
wants to merge 57 commits into
base: branch-24.03
Choose a base branch
from

Conversation

shivsundram
Copy link

@shivsundram shivsundram commented Dec 6, 2021

PR for Task Fusion

Task Fusion PR guide

This PR implements:

  1. Task Fusion in the Legate Core
    i. Fusibility determination and creation of the fused ops, done in the python code (see /core/runtime.py)
    ii. Issuing of a Fused task's sub-operations, currently done in cunumeric c++ code. This can be viewed in the companion PR Task Fusion, Constant Conversion Optimization, and 27pt stencil benchmark cupynumeric#150
    iii. An interface for accessing each registered Legate Tasks' function pointer, used for efficiently launching a fused task's sup-tasks
  2. Basic constant optimization
    i. eliminate issuing of expensive convert tasks for static/constant scalars (ie there's no need to defer execution of this type conversion, when the underlying Future's value is a constant embedded in the code)
    ii. This lives in cunumeric, but the changes can be viewed here (will be contributed in separate PR)
  3. Some changes to the benchmarks
    i. re-organization of the synchronizations in black_scholes. Instead checking the calls and puts!=NaN at the end of every benchmarking run, we check this only after all benchmarking runs have concluded.
    ii. A new 27-pt stencil benchmark. For demonstrating how fusion boosts perf even though it theoretically decreases task-level parallelism
    iii . Lives in cunumeric, but can be viewed in companion PR here
  4. TO_REMOVE: A C++ utility API for serializing a Legate Task, for issuing a legate task from another legate task - this will be removed from this PR, so feel free to ignore this code for now

Required for merge (ie TODO):
-Test OpenMP Support

Note
I am currently unable to move the Fused Task to the core; as doing this causes issues with the fused tasks's mapper tag. Until then, the Fused Task will live in Cunumeric, in the companion PR nv-legate/cupynumeric#150

Overview of Optimizations

Task Fusion
Given a window of buffered Legate tasks, fusion finds sub-windows of operations that can each be aggregated into a single launch/fused operation. In other words, given a 100% fusable window of N tasks, the runtime would normally issue O(N) meta/utility tasks (mapping calls, scheduling etc) for this window, whereas the fused operation would only require O(1) such meta-tasks, thus reducing overhead. There are currently four constraints dictating op fusability:

Fusability Constraints
1 Fusable Operations: Only a certain subset of Legate tasks are fusable (eg unary, binary, ternary ops)
2 Producer-Consumer: In the same fused op, one cannot write to one store representing an underlying buffer/regionfield, and then read from a different store representing the same underlying buffer/regionfield. This prevents the existence of invalid combinations of aliased partitions in a fused operation (eg the reads, adds, and mults in an in-place stencil are fusable, but not the write/update of the buffer at the end of the stencil iteration)
3 Launch Shape equivalence: All sub-operations in a fused task must have the same launch shape.
4 Projection Functor equivalence: If a store is used by mutliple sub-ops in a fused task, every subop's functor for indexing into that store must be equivalent.
5(Temporary) Existence of CuNumeric: Until the FusedTask is moved to the core, we can only fuse if the cuNumeric context exists (note the Fused Task being in cuNumeric does not preclude it from fusing non-cuNumeric tasks)

All constraints live in core/runtime. Each constraint's apply function is used to apply the constraint to a window of ops. This, by Dec 8, will be replaced with a new set of functions which apply constraints more efficiently (most of these are already in the PR and currently called apply2, but will just be called apply eventually)

_Currently, Constraint 4 is the most expensive to run, and it doesn't seem to ever trigger. If we can come up with a better way of testing this, runtime will further improve

Constant Optimization
Currently implemented in cunumeric/array.py, when issuing binary_ops. Instead of creating a convert operation to convert a known scalar constant (eg from fp64 to tp32), we simply create the scalar in place and eliminate the need for said operation. Not only does this enable larger levels of fusion (as convert is not a fusable op), but the operation in itself provides large performance boosts. Note this only triggers when the constant is a DeferredArray (not Eager, as this optimization actually slows down ops using EagerArrays )

Performance:

Depending on input sizes and GPU counts (1-16 currently tested) and GPU type(Pascal, Volta currently tested), we see
~1.6-2.x speedup for black scholes
~1.6-2x speedup for 27pt stencil
~1.15-2x speedup for Logistic Regression
~1.15 speedup for Jacobi Iteration
No statistically significant speedup for conjugate gradient (between 1.0x-1.05x)

The maximum window_size in both the fused and unfused baseline (ie nv-legate reference) is set to 50 in these tests. Speedup is even higher when the baseline uses a window_size of 1 (current setting in reference), but the fused code still uses n=50 (current setting in PR)

Testing

Currently running (have been testing as I go). Have not seen a single instance where fused and unfused generate different outputs.

Known Issues:

  1. In 1 benchmark ( conjugate gradient, only noticed on Lassen, not Sapling), we see <0.5% performance degradation with fusion turned on. This is likely not statistically significant, but if concerning I can design an additional constraint to prevent fusion here. Very little gets fused in this example, so the overhead of making the fused op becomes more tangible (specifically the few fused ops are all O(n), and the unfusable ops are all O(n^2))
  2. I have seen one instance where a fused task hangs, specifically on a >=4 node count on Lassen for logistic regression (for smaller node counts it works fine). On my branch this happens with fusion turned off as well, and is likely due to one of the smaller changes I made outside the fusion code. I'm investigating. Worth noting that with the given problem size, the reference implementation actually generates "inf" floating point values.

#self.validIDs.add(5) #Convert op
#self.validIDs.add(7) #dot op
#self.terminals.add(7) #dot op is only fusable as a terminal op in a window

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially introduced the notion of a 'terminal' op that is fusable only at the end of a window. I don't use this so will remove it.

intervals.append((start,end))
#print("proj", intervals)
return True, intervals

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated, the above constraint isn't particularly cheap....and it doesn't seem to ever be "used" I will look into the perf, but I suspect the get_projection method is the culprit here

fusion_checker.register_constraint(IdenticalProjection())
fusion_checker.register_constraint(ValidProducerConsumer())
can_fuse,fusable_sets, partitions = fusion_checker.can_fuse()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fusion constraints are registered above

frint("launch ufused input", op._task_id, input)
self.propogateFuture(op)
op.launch(op.strategy)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the above func will be cleaned up and simplified in the next day or so

if store._key_partition is not None:
partition=store._key_partition
else:
partition = store.compute_key_partition(restrictions)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the store already has a key partition (set by the fuser), no need to compute it
There may be a more elegant way of doing this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should check if the key partition satisfies the restrictions before you reuse it. And I think this logic should be moved to compute_key_partition. Just so you know, I'm currently refactoring the code that manages legate stores, so don't make any micro optimization like this, as that will become obsolete in the near future.

fprintf(stderr, "Found an uninitialized Legate store\n");
assert(false);
//fprintf(stderr, "Found an uninitialized Legate store\n");
//assert(false);
Copy link
Author

@shivsundram shivsundram Dec 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will come back to this; this was done to make constant optimization work; I will check if I indeed need this assertion removed/modified

@@ -794,8 +794,8 @@ def driver():
)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all changes to this file will be reverted before merge

name="legate.core",
version="0.1",
name="legate-core",
version="21.10.00",
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current version indeed 0.1? If so I'll undo the above change

def build_task(self, launch_domain, argbuf):
self._req_analyzer.analyze_requirements()
self._out_analyzer.analyze_requirements()

#pack fusion metadata
self.pack_fusion_metadata(argbuf, self._is_fused, self._fusion_metadata)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of this design. The fusion task shouldn't be any different from a normal Legate task. I believe the fusion metadata can be passed as a bunch of scalar arguments to the fusion task. If we do that, then we don't need to make every place in the core handle fusion. Making the default path aware of task fusion doesn't sound like a good design.

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is def worth discussing (I pondered this myself):
I did think about making them scalars, but was not sure if there were any other implications/downsides of storing all the metadata that way (eg are there a maximum number of scalars we can give a task), since we'd be sending in quite a few scalars to the task (or rather multiple lists, each of which can have >50 scalars; the number of scalars in 4/5 of these lists is equivalent to the fused_op length, even with potential deduplication of stores).
If we did this, in the task's scalars array, we'd thus be mixing "metadata" scalars (which are really a bunch of lists) with the actual scalars used by the sub-tasks, which seemed to not really adhere to the abstraction of having a dedicated "scalars" array in context data structure. I thus chose to serialize it as Legate Task argument data. As you stated, the downside is that the core/default has to be "fusion aware".

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if you don't see any potential downsides of making them all scalars, then yeah it could totally be implemented to be scalar based. Would mostly just have to move book-keeping logic from the deserializer into the Fused Legate Task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel what's actually missing here is an ability to pass an arbitrary chunk of data (probably a memoryview object in Python) to a task, which in theory possible, so we should probably extend the buffer builder interface to take an an arbitrary memory view object and pack it as a single contiguous chunk. The assumption here is that the task knows how to interpret those bytes. It shouldn't be too difficult for me to add it.

TaskContext(const Legion::Task* task,
const std::vector<Legion::PhysicalRegion>& regions,
Legion::Context context,
Legion::Runtime* runtime);

TaskContext(const Legion::Task* task, const std::vector<Legion::PhysicalRegion> regions)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this constructor necessary?

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each sub-op is called like this (via function-pointer):

for (int i=0; i<fused_op_len; i++){
      auto descriptor = Core::gpuDescriptors.find(opIDs[i])->second;
      descriptor(context);
}

However, the context above is not the same as the parent tasks's context; Whereas the parent task's context contains the "union" of all input outputs reductions and scalars over all subops, the above context should only include the inputs, outputs etc for sub-operation/task i. I used this constructor to build the operation-specific contexts, for holding the specific set of inputs outputs etc.

This brings up a potential issue of that the new context's task and regions fields are currently not specific to the sub-op. I'm not sure how much of an issue that is, as I only need the context structure to hold stores and scalars for the sub-ops. Ideally I wanted to avoid any performance costs of creating of new Task objects in the Fused Task, so I elected to keep the new context object as such.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need a new context in this case, but my question is more on why you couldn't use the existing constructor. The constructor you added leaves the runtime and context member uninitialized, leading to a crash if the child task tries to access the runtime (though no task in cunumeric is actually doing that). I think the context you are creating here should still have a valid runtime and context pointer.

That said, I think you actually need a slightly different change here, which allows TaskContext to override the legion task's id. The task object here is of the fusion task, so if you pass it as-is to a child task, the child task may think that it's not what it thinks it is. I believe no cuNumeric task is doing actually doing it, but leaving this unaddressed can be a foot gun later on.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a valid point. I initially tried using the normal constructor and got runtime errors with from the task deserializer. Recreating the issue I get: ( the error is from span.h : Assertion pos < size_ failed.)

legion_python: /home/shiv1/fmultiGPU/legate.core/src/core/utilities/span.h:38: decltype(auto) legate::Span<T>::operator[](size_t) [with T = const Legion::PhysicalRegion; size_t = long unsigned int]: Assertion `pos < size_' failed.
Signal 6 received by node 0, process 3131805 (thread 7f8d47bf3000) - obtaining backtrace
Signal 6 received by process 3131805 (thread 7f8d47bf3000) at: stack trace: 19 frames
  [0] = /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f8d70ce33c0]
  [1] = /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f8d707f718b]
  [2] = /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f8d707d6859]
  [3] = /lib/x86_64-linux-gnu/libc.so.6(+0x25729) [0x7f8d707d6729]
  [4] = /lib/x86_64-linux-gnu/libc.so.6(+0x36f36) [0x7f8d707e7f36]
  [5] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(+0x5a95d) [0x7f8d4428595d]
  [6] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskDeserializer::_unpack(legate::Store&)+0x10e) [0x7f8d44286d9e]
  [7] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskContext::TaskContext(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)+0x7f6) [0x7f8d44278626]
  [8] = /home/shiv1/fmultiGPU/legate.core/install/lib/libcunumeric.so(cunumeric::FusedOpTask::gpu_variant(legate::TaskContext&)+0x6e2) [0x7f8d1ce0a852]

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at the time I didn't spend much time debugging this and just wanted something that worked for the time being (the constructor I made didn't call the deserializer. I didn't want to reserialize/deserialize things anyways because it involves additional overhead). I'm guessing it has to do with the number futures or scalars passed in, but I can totally dive a little deeper into what's causing this

Copy link
Contributor

@magnatelee magnatelee Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect the crash is due to the fusion metadata you added that the current serializer doesn't understand, so this issue is tied with the fusion awareness introduced in the code base, which I suggest removing. Actually, we shouldn't be deserializing the task argument for the child task context, so you're right that there be a constructor that doesn't serialize the argument. I guess we could add an argument to the existing constructor to control the behavior.

@@ -121,13 +130,14 @@ class TaskContext {
public:
ReturnValues pack_return_values() const;

private:
public:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you want to expose the private members here?

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh this should be reverted. This is an artifact from when we were using the inline launcher, and I needed access to these

input_start, output_start, offset_start = 0,0,0
reduction_start, scalar_start, future_start = 0,0,0

for op in ops:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we'd eventually want coalescing of arguments here as well. Right now, the fusion code passes the same store used by multiple tasks multiple times, hoping that the launcher at least coalesces region requirements for those duplicates. But, we can identify that the same store is being used in multiple tasks when we scan them for dataflow analysis. So, we could potentially pass a deduplicated list of stores to the fusion task with a mapping between the original indices and those in the store list that is passed to the fusion task.

Copy link
Author

@shivsundram shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we could deduplicate these, and I don't imagine this being too difficult. We would just have to place the offset of the dedpulicated store (eg in line 1145) in the offsets array, and it should just work

@magnatelee
Copy link
Contributor

Thanks a lot for all your hard work on this PR. I've left several comments and questions, but you don't necessarily need to address all of them. I'll pick this up probably after the break and we can work together to make this ready for the merge.

ty.uint32,
)
ty.uint32,)
self._window_size=50
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be hardcoded, will remove

@marcinz
Copy link
Collaborator

marcinz commented Jan 26, 2023

@shivsundram @magnatelee What is the status of this PR?

@magnatelee
Copy link
Contributor

@marcinz I'm planning to port/rebase this once the planned refactoring on the core is done.

@marcinz marcinz changed the base branch from branch-22.01 to branch-24.01 November 9, 2023 16:54
@marcinz marcinz changed the base branch from branch-24.01 to branch-24.03 February 22, 2024 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants